Search CORE

23 research outputs found

Artificial Intelligence and Statistics

Author: Kumbier Karl
Yu Bin
Publication venue
Publication date: 07/12/2017
Field of study

Artificial intelligence (AI) is intrinsically data-driven. It calls for the application of statistical concepts through human-machine collaboration during generation of data, development of algorithms, and evaluation of results. This paper discusses how such human-machine collaboration can be approached through the statistical concepts of population, question of interest, representativeness of training data, and scrutiny of results (PQRS). The PQRS workflow provides a conceptual framework for integrating statistical ideas with human input into AI products and research. These ideas include experimental design principles of randomization and local control as well as the principle of stability to gain reproducibility and interpretability of algorithms and data results. We discuss the use of these principles in the contexts of self-driving cars, automated medical diagnoses, and examples from the authors' collaborative research

arXiv.org e-Print Archive

eScholarship - University of California

Veridical Data Science

Author: Kumbier Karl
Yu Bin
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 12/11/2019
Field of study

Building and expanding on principles of statistics, machine learning, and scientific inquiry, we propose the predictability, computability, and stability (PCS) framework for veridical data science. Our framework, comprised of both a workflow and documentation, aims to provide responsible, reliable, reproducible, and transparent results across the entire data science life cycle. The PCS workflow uses predictability as a reality check and considers the importance of computation in data collection/storage and algorithm design. It augments predictability and computability with an overarching stability principle for the data science life cycle. Stability expands on statistical uncertainty considerations to assess how human judgment calls impact data results through data and model/algorithm perturbations. Moreover, we develop inference procedures that build on PCS, namely PCS perturbation intervals and PCS hypothesis testing, to investigate the stability of data results relative to problem formulation, data cleaning, modeling decisions, and interpretations. We illustrate PCS inference through neuroscience and genomics projects of our own and others and compare it to existing methods in high dimensional, sparse linear model simulations. Over a wide range of misspecified simulation models, PCS inference demonstrates favorable performance in terms of ROC curves. Finally, we propose PCS documentation based on R Markdown or Jupyter Notebook, with publicly available, reproducible codes and narratives to back up human choices made throughout an analysis. The PCS workflow and documentation are demonstrated in a genomics case study available on Zenodo

arXiv.org e-Print Archive

Iterative Random Forests to detect predictive and stable high-order interactions

Author: Basu Sumanta
Brown James B.
Kumbier Karl
Yu Bin
Publication venue
Publication date: 23/12/2017
Field of study

Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on Random Forests (RF), Random Intersection Trees (RITs), and through extensive, biologically inspired simulations, we developed the iterative Random Forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with same order of computational cost as RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, novel third-order interactions, e.g. between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry into the molecular mechanisms underlying genome biology

arXiv.org e-Print Archive

Crossref

University of Birmingham Research Portal

eScholarship - University of California

Refining interaction search through signed iterative Random Forests

Author: Basu Sumanta
Brown James B.
Celniker Susan
Kumbier Karl
Yu Bin
Publication venue
Publication date: 16/10/2018
Field of study

Advances in supervised learning have enabled accurate prediction in biological systems governed by complex interactions among biomolecules. However, state-of-the-art predictive algorithms are typically black-boxes, learning statistical interactions that are difficult to translate into testable hypotheses. The iterative Random Forest algorithm took a step towards bridging this gap by providing a computationally tractable procedure to identify the stable, high-order feature interactions that drive the predictive accuracy of Random Forests (RF). Here we refine the interactions identified by iRF to explicitly map responses as a function of interacting features. Our method, signed iRF, describes subsets of rules that frequently occur on RF decision paths. We refer to these rule subsets as signed interactions. Signed interactions share not only the same set of interacting features but also exhibit similar thresholding behavior, and thus describe a consistent functional relationship between interacting features and responses. We describe stable and predictive importance metrics to rank signed interactions. For each SPIM, we define null importance metrics that characterize its expected behavior under known structure. We evaluate our proposed approach in biologically inspired simulations and two case studies: predicting enhancer activity and spatial gene expression patterns. In the case of enhancer activity, s-iRF recovers one of the few experimentally validated high-order interactions and suggests novel enhancer elements where this interaction may be active. In the case of spatial gene expression patterns, s-iRF recovers all 11 reported links in the gap gene network. By refining the process of interaction recovery, our approach has the potential to guide mechanistic inquiry into systems whose scale and complexity is beyond human comprehension

arXiv.org e-Print Archive

Definitions, methods, and applications in interpretable machine learning.

Author: Baehrens
Bin Yu
Brennan
Chandan Singh
Hooker
Karl Kumbier
Letham
Lim
McKinney
Mordvintsev
Pimentel
Ramos
Reza Abbasi-Asl
Shi
Strobl
Tibshirani
W. James Murdoch
Yu
Publication venue: eScholarship, University of California
Publication date: 01/10/2019
Field of study

Machine-learning models have demonstrated great success in learning complex patterns that enable them to make predictions about unobserved data. In addition to using models for prediction, the ability to interpret what a model has learned is receiving an increasing amount of attention. However, this increased focus has led to considerable confusion about the notion of interpretability. In particular, it is unclear how the wide array of proposed interpretation methods are related and what common concepts can be used to evaluate them. We aim to address these concerns by defining interpretability in the context of machine learning and introducing the predictive, descriptive, relevant (PDR) framework for discussing interpretations. The PDR framework provides 3 overarching desiderata for evaluation: predictive accuracy, descriptive accuracy, and relevancy, with relevancy judged relative to a human audience. Moreover, to help manage the deluge of interpretation methods, we introduce a categorization of existing techniques into model-based and post hoc categories, with subgroups including sparsity, modularity, and simulatability. To demonstrate how practitioners can use the PDR framework to evaluate and understand interpretations, we provide numerous real-world examples. These examples highlight the often underappreciated role played by human audiences in discussions of interpretability. Finally, based on our framework, we discuss limitations of existing methods and directions for future work. We hope that this work will provide a common vocabulary that will make it easier for both practitioners and researchers to discuss and choose from the full range of interpretation methods

Crossref

eScholarship - University of California

Recommended from our members

Domain-inspired machine learning for hypothesis extraction in biological data

Author: Kumbier Karl
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Rapidly moving technologies are transforming the rate at which researchers accumulate information. Large, rich datasets hold promises of new insights into complex natural phenomena that will help advance the frontier of science. Here we aim to develop new statistics/data science principles and scalable algorithms for extracting reliable and reproducible information from these data.Chapter 1 provides an overview of the work contained in this thesis. It discusses the growing availability of genomic data and the statistical machine learning tools that are being used to provide a systems-level understanding of genomic phenomena.Chapter 2 introduces the predictability, computability, and stability (PCS) framework. The PCS framework builds on key ideas in machine learning, using predictability as a reality check and evaluating computational considerations in data collection, data storage and algorithm design. It augments predictability and computability with an overarching stability principle, which expands statistical uncertainty considerations to assesses how results vary with respect to choices (or perturbations) made across the data science life cycle. In this chapter, we develop PCS inference through perturbation intervals and PCS hypothesis testing to investigate the reliability of data results. We compare PCS inference with existing methods in high-dimensional sparse linear model simulations to demonstrate that our approach compares favorably to others, in terms of ROC curves, over a wide range of simulation settings. Finally, we propose documentation based on R Markdown, iPython, or Jupyter Notebook, with publicly available, reproducible codes and narratives to justify human choices made throughout an analysis.As an example of the PCS framework in practice, chapter 3 develops the iterative Random Forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with same order of computational cost as Random Forests (RF). We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, 80% of the pairwise transcription factor interactions iRF identified as stable have been previously reported as physical interactions. Moreover, novel third-order interactions, e.g. between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry into the molecular mechanisms underlying genome biology.Chapter 4 refines iRF to explicitly map responses as a function of interacting features. Our proposed method, signed iRF (siRF), describes "subsets" of rules that frequently occur on RF decision paths. We refer to these rule subsets as signed interactions. RF decision paths containing the same signed interaction share not only a set of interacting features but also exhibit similar thresholding behavior, and thus describe a consistent functional relationship between interacting features and responses. We formulate stable and predictive importance metrics (SPIMs) to rank signed interactions in terms of their stability, predictive accuracy, and strength of interaction. For each SPIM, we define null importance metrics that characterize its expected behavior under known structure. We evaluate siRF in biologically inspired simulations and two case studies: predicting enhancer activity and spatial gene expression patterns. In the case of spatial gene expression patterns, siRF recovered all 11 reported links in the gap gene network. In the case of enhancer activity, siRF discovered rules that identify enhancer elements in Drosophila embryos with high precision, suggesting candidate biological mechanisms for experimental studies. By refining the process of interaction discovery, siRF has the potential to guide mechanistic inquiry into systems whose scale and complexity is beyond human comprehension

eScholarship - University of California

Recommended from our members

Domain-inspired machine learning for hypothesis extraction in biological data

Author: Kumbier Karl
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

eScholarship - University of California

Recommended from our members

Artificial intelligence and statistics

Author: Kumbier Karl
Yu Bin
Publication venue: eScholarship, University of California
Publication date: 01/01/2018
Field of study

eScholarship - University of California